Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build #310

Kangyan-Zhou · 2025-11-01T06:13:59Z

Add an option to only deploy model agent to gpu nodes so that if we won't waste resource if we don't have CPU serving.

Add option to use Ubuntu 22.04 instead of the default Oracle Linux.

Add cache for build to reduce build time across the board, especially xnet.

…build

gemini-code-assist · 2025-11-01T06:14:22Z

Summary of Changes

Hello @Kangyan-Zhou, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new capability to deploy the ome model agent specifically onto Kubernetes nodes equipped with GPUs, providing finer control over deployment targets. Simultaneously, it significantly optimizes the build process for various components by integrating advanced caching mechanisms for Go and Docker, which will lead to faster and more efficient builds. Additionally, it enhances the flexibility of Dockerfiles by allowing configurable base images and improves error reporting in a utility function.

Highlights

GPU Node Deployment for Model Agent: Introduced a new option to deploy the model agent exclusively to GPU-enabled Kubernetes nodes, allowing for more precise resource allocation. This is controlled via modelAgent.gpuNodesOnly and modelAgent.gpuNodeLabel in values.yaml.
Comprehensive Build Caching Improvements: Implemented significant enhancements to Go and Docker builds across multiple components by leveraging Docker BuildKit's cache mounts for Go modules and build caches. This includes updates to .gitignore, Makefile, and all relevant Dockerfiles.
Dockerfile Base Image Flexibility: The model-agent and ome-agent Dockerfiles now support a configurable base image (BASE_IMAGE build argument) and include conditional logic to adapt package installations for both microdnf (Oracle Linux) and apt-get (Ubuntu/Debian) based systems.
Enhanced Error Reporting: Improved the SnapshotDownload function to collect and report detailed error messages alongside file paths when errors occur during the download process, aiding in debugging.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two main improvements: flexible deployment of the model agent to GPU nodes and enhanced build caching for Docker and Go. The changes to the Helm charts add valuable flexibility for GPU-specific deployments. The Dockerfile and Makefile modifications significantly improve build times by leveraging caching, which is a great enhancement for the development workflow. I've found a critical issue in the Helm template logic that needs to be addressed, along with a couple of medium-severity suggestions for improving the Makefile and error handling.

gemini-code-assist · 2025-11-01T06:17:19Z

charts/ome-resources/templates/model-agent-daemonset/daemonset.yaml

      containers:
      - name: model-agent
-        image: {{ include "ome.imageWithHub" (dict "values" .Values "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}
+        image: {{ include "ome.imageWithHub" (dict "values" (merge (dict "global" (dict "hub" (default .Values.global.hub .Values.modelAgent.image.hub))) .Values) "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}


This logic for overriding the image hub appears to have two issues:

The default function arguments are swapped. It should be (default .Values.modelAgent.image.hub .Values.global.hub) to prioritize the model-agent specific hub, as documented in values.yaml.

The merge function arguments are also swapped. Helm's merge gives precedence to the right-most dictionary, so .Values is currently overwriting your hub override. The arguments should be swapped to merge the override into .Values.

Here is the corrected line:

image: {{ include "ome.imageWithHub" (dict "values" (merge .Values (dict "global" (dict "hub" (default .Values.modelAgent.image.hub .Values.global.hub)))) "repository" .Values.modelAgent.image.repository "tag" .Values.modelAgent.image.tag) }}

gemini-code-assist · 2025-11-01T06:17:19Z

Makefile

+export GOMODCACHE
+
+# Ensure cache directories exist
+$(shell mkdir -p $(GOCACHE) $(GOMODCACHE))


The $(shell mkdir -p ...) command is executed every time make is invoked, during the parsing phase. While mkdir -p is idempotent, this is generally not a good practice as it can have side effects and is not explicit. It's better to create directories as part of a target's recipe or as a prerequisite for targets that need them. For example, you could create a .PHONY setup target and add it as a dependency to your build targets.

gemini-code-assist · 2025-11-01T06:17:19Z

pkg/hfutil/hub/repo.go

 	if totalErrors > 0 {
 		// Collect error details
 		var errorFiles []string
+		var errorMessages []string


You've added errorMessages to collect more detailed errors, which is a great improvement for debuggability. However, this new slice is populated but never used later in the function. The function still prints the less informative errorFiles slice and returns a generic error message. Consider using errorMessages to provide more detailed feedback to the user on failure, for example by joining them into the returned error string or logging them.

Kangyan-Zhou added 2 commits October 31, 2025 22:56

Add an option to only deploy model agent to gpu nodes; Add cache for …

2df745c

…build

add cache for xnet

9d53808

Kangyan-Zhou requested a review from slin1237 as a code owner November 1, 2025 06:13

Kangyan-Zhou changed the title ~~Ome agent gpu~~ Make OME Deployment Flexible with GPU; Improve Build Cache Nov 1, 2025

gemini-code-assist bot reviewed Nov 1, 2025

View reviewed changes

Kangyan-Zhou changed the title ~~Make OME Deployment Flexible with GPU; Improve Build Cache~~ Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build Nov 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build #310

Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build #310

Uh oh!

Kangyan-Zhou commented Nov 1, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build #310

Are you sure you want to change the base?

Make OME Deployment Flexible with GPU only option; Improve Build Cache and Docker Build #310

Uh oh!

Conversation

Kangyan-Zhou commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kangyan-Zhou commented Nov 1, 2025 •

edited

Loading